Welcome to this full write-up forecasting example using the nycflights13 data set. The solutions here will follow the chapters in Forecasting Principles and Practice, 3rd edition, by Rob J. Hyndman, and George Athanasopoulos.

The chapters are:

1. Creating time series objects, plotting and autocorrelation
2. Time Series Decomposition
3. Time Series Features
4. The foreaster’s basic toolbox.
5. Time series regression models
6. Exponential smoothing models
7. ARIMA models
8. Dynamic regression models
9. Forecasting hierarchical and grouped time series
10. Advanced forecasting methods

1. Creating time series objects, plotting and autocorrelation

We will begin by importing the required libraries into r.

library(nycflights13)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(fpp3)
## ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
## ✓ lubridate   1.8.0     ✓ feasts      0.2.2
## ✓ tsibble     1.1.0     ✓ fable       0.3.1
## ✓ tsibbledata 0.3.0
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## x lubridate::date()    masks base::date()
## x dplyr::filter()      masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval()  masks lubridate::interval()
## x dplyr::lag()         masks stats::lag()
## x tsibble::setdiff()   masks base::setdiff()
## x tsibble::union()     masks base::union()
library(clock)
## 
## Attaching package: 'clock'
## The following object is masked from 'package:lubridate':
## 
##     as_date

Our first step will be converting the nycflights13 data set into a tsibble, so time series forcasting can be done with the data. This will include setting up a date column to make the analysis much easier and faster

flights <- nycflights13::flights %>% 
    mutate(date = date_build(year = year, month = month, day = day))

flight1 <- flights %>% 
  mutate(date = date_build(year = year, month = month, day = day))

flight1 <- flight1 %>% 
  select(date, dep_time:distance)

Clean up the data

The data has a few missing points, so it needs to be cleaned up before we can do full analysis with it. Specifically, it appears there are 8,255 rows with missing departure time, departure delay, arrival time, arrival delay and air time. The most likely explanation is that these flights were scheduled, but did not fly. Let’s remove these rows from our data set.

# 8,255 rows with missing departure times
flight1 %>%  filter(is.na(dep_time))
## # A tibble: 8,255 × 14
##    date       dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <date>        <int>          <int>     <dbl>    <int>          <int>
##  1 2013-01-01       NA           1630        NA       NA           1815
##  2 2013-01-01       NA           1935        NA       NA           2240
##  3 2013-01-01       NA           1500        NA       NA           1825
##  4 2013-01-01       NA            600        NA       NA            901
##  5 2013-01-02       NA           1540        NA       NA           1747
##  6 2013-01-02       NA           1620        NA       NA           1746
##  7 2013-01-02       NA           1355        NA       NA           1459
##  8 2013-01-02       NA           1420        NA       NA           1644
##  9 2013-01-02       NA           1321        NA       NA           1536
## 10 2013-01-02       NA           1545        NA       NA           1910
## # … with 8,245 more rows, and 8 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>
# remove rows with missing departure times
flight1 <- flight1 %>% 
  filter(!is.na(dep_time))

There are still 2,808 NAs in the data set. It appears all these flights did fly. Let’s find these flights and see what we can do to clean everything up:

sum(is.na(flight1)) # 2808
## [1] 2808
# Let's see where the NAs are located:
flight1 %>% 
  filter(!complete.cases(.))
## # A tibble: 1,175 × 14
##    date       dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <date>        <int>          <int>     <dbl>    <int>          <int>
##  1 2013-01-01     1525           1530        -5     1934           1805
##  2 2013-01-01     1528           1459        29     2002           1647
##  3 2013-01-01     1740           1745        -5     2158           2020
##  4 2013-01-01     1807           1738        29     2251           2103
##  5 2013-01-01     1939           1840        59       29           2151
##  6 2013-01-01     1952           1930        22     2358           2207
##  7 2013-01-01     2016           1930        46       NA           2220
##  8 2013-01-02      905            822        43     1313           1045
##  9 2013-01-02     1125            925       120     1445           1146
## 10 2013-01-02     1848           1840         8     2333           2151
## # … with 1,165 more rows, and 8 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>
# Just three columns contain all 2,808 NAs, so we'll build special data sets when we are making models to predict those values, otherwise we'll use the full data set when we create models to predict the values where we do have full data (such as predicting departure time):
sum(is.na(flight1$arr_time), is.na(flight1$arr_delay), is.na(flight1$air_time)) # 2,808
## [1] 2808

Just remember, when building models that include arrival time, arrival delay, and air time, to remove the rows with NAs first.

Aggregate the data and create summary variables. The mean flight number column is used to create a key for the tsibble (next step)

flight1 <- flight1 %>% 
  group_by(date) %>% 
  summarise(
    flights = n(),
    mean_dep_time = mean(dep_time),
    mean_dep_delay = mean(dep_delay),
    mean_arrival_time = mean(arr_time, na.rm = TRUE),
    mean_arrival_delay = mean(arr_delay, na.rm = TRUE),
    mean_air_time = mean(air_time, na.rm = TRUE),
    mean_distance = mean(distance),
    key = 100
  ) %>% 
  ungroup()

Calculate the most common carrier, destination and origin, and add those to the data set

# Determine most common destination per day
mcd <- flights %>% 
  select(date, dest) %>% 
  group_by(date, dest) %>% 
  summarise(most_common_destination = n()) %>% 
  arrange(desc(most_common_destination)) %>% 
  filter(row_number() == 1) %>% 
  arrange(date)
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
flight1 <- flight1 %>% 
  mutate(most_common_destination = mcd$dest)

# Determine most common carrier per day
mcc <- flights %>% 
  select(date, carrier) %>% 
  group_by(date, carrier) %>% 
  summarise(most_common_carrier = n()) %>% 
  arrange(desc(most_common_carrier)) %>% 
  filter(row_number() == 1) %>% 
  arrange(date)
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
flight1 <- flight1 %>% 
  mutate(most_common_carrier = mcc$carrier)

# Determine most common origin per day
mco <- flights %>% 
  select(date, origin) %>% 
  group_by(date, origin) %>% 
  summarise(most_common_origin = n()) %>% 
  arrange(desc(most_common_origin)) %>% 
  filter(row_number() == 1) %>% 
  arrange(date)
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
flight1 <- flight1 %>% 
  mutate(most_common_origin = origin)

Create a tsibble, from which we can do forecasting

flight1 <- flight1 %>% 
  as_tsibble(index = date, key = key) %>% 
  arrange(date) %>% 
  ungroup()

Create a plot of the number of flights per day:

flight1 %>% 
  autoplot(flights)

Create a plot for the mean departure time:

flight1 %>%
  autoplot(mean_dep_time)

Superimpose mean departure time and mean departure delay:

flight1 %>% 
  autoplot(mean_dep_time) +
  geom_line(aes(y = mean_dep_delay))

Exploring relationships between variables in our time series

flight1 %>% 
 ggplot(aes(x = mean_dep_time, y = mean_dep_delay)) +
  geom_point() +
  labs(x = "Average departure time",
       y = "Average departure delay")

Printing correlation between variables in our time series

flight1 %>% 
  GGally::ggpairs()
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Discussion: Most correlations are <0.2, but a few are much larger:

The correlation between mean_distance and date = 0.473. This implies the later in the year a person flies, the farther they will travel.

The largest correlation is between mean of arrival delay, and mean of departure delay. That value is 0.944. There is a 94% probability in this data set that if a plane leaves late, it will arrive late.

Lag plots

flight1 %>% 
  gg_lag(flights, geom = "point") +
  labs(x = "lag(flights, k)")

(from the text) Each graph shows \(y_t\) plotted against \(y_{t-k}\) for different values of \(k\)

Autocorrelation

(from the text) Just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series.

There are several autocorrelation coefficients, corresponding to each panel in the lag plot. For example, \(r_1\) measures the relationship between \(y_t\) and \(y_{t-1}\), \(r_2\) measures the relationship between \(y_t\) and \(y_{t-2}\), and so on.

flight1 %>% ACF(flights, lag_max = 10)
## # A tsibble: 10 x 3 [1D]
## # Key:       key [1]
##      key   lag     acf
##    <dbl> <lag>   <dbl>
##  1   100    1D  0.222 
##  2   100    2D -0.117 
##  3   100    3D -0.0607
##  4   100    4D -0.0696
##  5   100    5D -0.120 
##  6   100    6D  0.106 
##  7   100    7D  0.652 
##  8   100    8D  0.0905
##  9   100    9D -0.0937
## 10   100   10D -0.0482

We can view the values in the autocorrelation function:

flight1 %>% 
  ACF(flights) %>% 
  autoplot() +
  labs(title = "Flights out of New York City in 2013, by day")

The strongest lags are at 7. 14, and 21 days. This is clearly due to the weekly nature of the business. \(r_1\) is strong because the number of flights on any given day is strongly related to the number of flights the previous day.

(from the text) The dashed blue lines indicate whether the correlations are significantly different from zero.

Trend and seasonality in our plot:

(from the text) When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in value. So the ACF of a trended time series tends to have positive values that slowly decrease as the lags increase.

When data are seasonal, the autocorrelations will be larger for the seasonal lags (at multiples of the seasonal period) than for other lags.

If the values of \(r_1\) through \(r_t\) are all within the dotted blue lines, the results show white noise.

2. Time Series Decomposition

Time series components

(from the text) If we assume an aditive decomposition, then we can write:

\(y_t = S_t + T_t + R_t\)

where \(y_t\) is the data, \(S_t\) is the seasonal component, \(T_t\) is the trend-cycle component, and \(R_t\) is the remainder component, all at period t. Alternatively, a multiplicative decomposition would be written as:

\(y_t = S_t \times T_t \times R_t\)

Example: Decomposition of flight1

This will allow us to look at the decomposition of flights in the flight1 data set:

dcmp1 <- flight1 %>% 
  model(stl = STL(flights))
components(dcmp1)
## # A dable: 365 x 8 [1D]
## # Key:     key, .model [1]
## # :        flights = trend + season_week + remainder
##      key .model date       flights trend season_week remainder season_adjust
##    <dbl> <chr>  <date>       <int> <dbl>       <dbl>     <dbl>         <dbl>
##  1   100 stl    2013-01-01     838  852.        34.8    -48.7           803.
##  2   100 stl    2013-01-02     935  856.        40.4     38.7           895.
##  3   100 stl    2013-01-03     904  860.        66.8    -22.8           837.
##  4   100 stl    2013-01-04     909  863.        20.3     25.3           889.
##  5   100 stl    2013-01-05     717  867.      -194.      44.3           911.
##  6   100 stl    2013-01-06     831  869.       -31.3     -6.97          862.
##  7   100 stl    2013-01-07     930  872.        62.2     -3.91          868.
##  8   100 stl    2013-01-08     895  871.        36.3    -12.1           859.
##  9   100 stl    2013-01-09     897  870.        40.8    -13.7           856.
## 10   100 stl    2013-01-10     929  868.        67.6     -6.26          861.
## # … with 355 more rows

Note that we can do the same for any other variable in our data set, such as mean arrival time:

dcmp2 <- flight1 %>% 
  model(stl = STL(mean_arrival_time))
components(dcmp2)
## # A dable: 365 x 8 [1D]
## # Key:     key, .model [1]
## # :        mean_arrival_time = trend + season_week + remainder
##      key .model date       mean_arrival_time trend season_week remainder
##    <dbl> <chr>  <date>                 <dbl> <dbl>       <dbl>     <dbl>
##  1   100 stl    2013-01-01             1562. 1546.     5.58        10.5 
##  2   100 stl    2013-01-02             1533. 1543.    -0.00116    -10.3 
##  3   100 stl    2013-01-03             1536. 1540.    -1.37        -2.01
##  4   100 stl    2013-01-04             1519. 1537.   -45.2         27.6 
##  5   100 stl    2013-01-05             1509. 1534.    11.7        -36.4 
##  6   100 stl    2013-01-06             1573. 1531.    39.3          2.49
##  7   100 stl    2013-01-07             1516. 1529.    -9.28        -3.14
##  8   100 stl    2013-01-08             1534. 1525.     4.76         4.17
##  9   100 stl    2013-01-09             1523. 1521.    -0.0914       1.63
## 10   100 stl    2013-01-10             1523. 1519.    -1.61         6.31
## # … with 355 more rows, and 1 more variable: season_adjust <dbl>

Plot the data and the trend of the data for number of flights per day

components(dcmp1) %>% 
  as_tsibble() %>% 
  autoplot(flights, colour = "gray") + 
  geom_line(aes(y = trend), colour = "#D55E00") +
  labs(
    y = "flights",
    title = "Number and trend of flights per day"
  )

Show all the components in a single file:

components(dcmp1) %>% autoplot()

Seasonally adjusted data

From the text: If the seasonal component is removed from the original data, the resulting values are the “seasonally adjusted” data. For an additive decomposition, the seasonally adjusted data are given by \(y_t - S_t\), and for multiplicative data the seasonally adjusted values are obtained by \(\frac{y_t}{S_t}\)

components(dcmp1) %>% 
  as_tsibble() %>% 
  autoplot(flights, colour = "gray") +
  geom_line(aes(y = season_adjust), colour = "#0072B2") +
  labs(
    y = "Flights",
    title = "Number and seasonally adjusted number of flights per day"
  )

Moving averages

(from the text) The classical method of time series decomposition originated in the 1920s and was widely used until the 1950s. It still forms the basis of many time series decomposition methods, so it is important to understand how it works. The first step in a classical decomposition is to use a moving average method to estimate the trend-cycle, so we begin by discussing moving averages.

Moving average smoothing

A moving average of order \(m\) can be written as:

\(\hat{T_t} = \frac{1}{m}\sum_{j = -k}^k y_{t+j}\)

where \(m = 2k+1\)

Let us calculate and view a 7-day moving average for the flight data. We use a 7 day moving average because our data has weekly seasonality:

flight1 %>% 
  mutate(
    `7-MA` = slider::slide_dbl(flights, mean,
                               .before = 3, .after = 3, .complete = TRUE)
  ) %>% 
  autoplot(flights, colour = "gray") +
  geom_line(aes(y = `7-MA`), colour = "#D55E00")
## Warning: Removed 6 row(s) containing missing values (geom_path).

Classical decomposition

The classical decomposition method originated in the 1920s. It is a relatively simple procedure, and forms the starting point for most other methods of time series decomposition. There are two forms of classical decomposition: an additive decomposition and a multiplicative decomposition. These are described below for a time series with seasonal period \(m = 4\)for quarterly data, \(m = 12\) for monthly data, \(m = 7\) for daily data with a weekly pattern).

Additive decomposition

Step 1

If \(m\) is an even number, compute the trend-cycle component, \(\hat{T}^t\), using a 2 x m-MA. if m is an odd number, compute the trend-cycle component, \(\hat{T}^t\), using an m-MA

Step 2

Calculate the detrended series: \(y_t - \hat{T}^t\)

Step 3

o estimate the seasonal component for each season, simply average the detrended values for that season. For example, with monthly data, the seasonal component for March is the average of all the detrended March values in the data. These seasonal component values are then adjusted to ensure that they add to zero. The seasonal component is obtained by stringing together these monthly values, and then replicating the sequence for each year of data. This gives \(\hat{S}_t\)

Step 4

The remainder component is calcluated bu subtracting the estimated seasonal and trend-cycle components: \(\hat{R}_t = y_t - \hat{T}_t - \hat{S}_t\)

Example of classical additive decomposition

flight1 %>% 
  model(
    classical_decomposition(flights, type = "additive")
  ) %>% 
  components() %>% 
  autoplot() +
  labs(title = "Classical additive decomposition")
## Warning: Removed 3 row(s) containing missing values (geom_path).

Example of classical multiplcative decomposition

flight1 %>% 
  model(
    classical_decomposition(flights ~ season(12), type = "mult")
  ) %>% 
  components() %>% 
  autoplot() +
  labs(title = "Classical multiplicative decomposition")
## Warning: Removed 6 row(s) containing missing values (geom_path).

Comments not to use classical decomposition from the text:

While classical decomposition is still widely used, it is not recommended, as there are now several much better methods.

STL Decomposition

(from the text) STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using Loess,” while loess is a method for estimating nonlinear relationships. The STL method was developed by R. B. Cleveland et al. (1990).

STL has several advantages over classical decomposition, and the SEATS and X-11 methods:

• Unlike SEATS and X-11, STL will handle any type of seasonality, not only monthly and quarterly data.
• The seasonal component is allowed to change over time, and the rate of change can be controlled by the user.
• The smoothness of the trend-cycle can also be controlled by the user.
• It can be robust to outliers (i.e., the user can specify a robust decomposition), so that occasional unusual observations will not affect the estimates of the trend-cycle and seasonal components. They will, however, affect the remainder component.

flight1 %>% 
  model(
    STL(flights ~ trend(window = 7) +
          season(window = "periodic"),
        robust = "TRUE")) %>% 
  components %>% 
  autoplot()

3. Time Series Features

The feasts package includes functions for computing FEatures And Statistics from Time Series (hence the name). We have already seen some time series features. For example, the autocorrelations discussed in Section 2.8 can be considered features of a time series — they are numerical summaries computed from the series.

Some simple statistics

This give the quantiles for the number of flights

flight1 %>% features(flights, sd)
## New names:
## * `` -> ...1
## # A tibble: 1 × 2
##     key  ...1
##   <dbl> <dbl>
## 1   100  98.1

It’s also possible to calculate the min, max, and standard deviation (sd) of the data

ACF features

(from the text) Autocorrelations were discussed in Section 2.8. All the autocorrelations of a series can be considered features of that series. We can also summarise the autocorrelations to produce new features; for example, the sum of the first ten squared autocorrelation coefficients is a useful summary of how much autocorrelation there is in a series, regardless of lag.

We can also compute autocorrelations of the changes in the series between periods. That is, we “difference” the data and create a new time series consisting of the differences between consecutive observations. Then we can compute the autocorrelations of this new differenced series. Occasionally it is useful to apply the same differencing operation again, so we compute the differences of the differences. The autocorrelations of this double differenced series may provide useful information.

The feat_acf() function computes a selection of the autocorrelations discussed here. It will return six or seven features:

• the first autocorrelation coefficient from the original data;
• the sum of squares of the first ten autocorrelation coefficients from the original data;
• the first autocorrelation coefficient from the differenced data;
• the sum of squares of the first ten autocorrelation coefficients from the differenced data;
• the first autocorrelation coefficient from the twice differenced data;
• the sum of squares of the first ten autocorrelation coefficients from the twice differenced data;

Applying ACF Features to the flights data

flight1 %>% features(flights, feat_acf)
## # A tibble: 1 × 8
##     key  acf1 acf10 diff1_acf1 diff1_acf10 diff2_acf1 diff2_acf10 season_acf1
##   <dbl> <dbl> <dbl>      <dbl>       <dbl>      <dbl>       <dbl>       <dbl>
## 1   100 0.222 0.541     -0.277       0.813     -0.504        1.15       0.652

STL Features

(from the text) The STL decomposition discussed in Chapter 3 is the basis for several more features.

A time series decomposition can be used to measure the strength of trend and seasonality in a time series. Recall that the decomposition is written as

\(y_t = S_t + T_t + R_t\)

where \(T_t\) is the smoothed trend component, \(S_t\) is the seasonal component, and \(R_t\) is a remainder component. For strongly trended data, the seasonally adjusted data should have much more variation than the remainder component. Therefore Var(\(R_t\))/Var(\(T_t + R_T)\) should be relatively small. But for data with little or no trend, the two vairances should be approximately the same. So we define the strength of trend as:

\(F_{T} = max(0, 1 - \frac{var(R_t)}{var(S_t + R_t)})\)

This will give a measure of the strength of the trend between 0 and 1. Because the variance of the remainder might occasionally be even larger than the variance of the seasonally adjusted data, we set the minimal possible value of \(F_T\)equal to zero.

The strength of seasonality is defined similarly, but with respect to the detrended data rather than the seasonally adjusted data:

\(F_{S} = max(0, 1 - \frac{var(R_t)}{var(S_t + R_t)})\)

Example of STL Decomposition

flight1 %>% features(flights, feat_stl)
## # A tibble: 1 × 10
##     key trend_strength seasonal_strength_week seasonal_peak_we… seasonal_trough…
##   <dbl>          <dbl>                  <dbl>             <dbl>            <dbl>
## 1   100          0.450                  0.743                 0                5
## # … with 5 more variables: spikiness <dbl>, linearity <dbl>, curvature <dbl>,
## #   stl_e_acf1 <dbl>, stl_e_acf10 <dbl>

Exploring all the features of the feasts package, in one line of code

flight1 %>% 
  features(flights, feature_set(pkgs = "feasts")) %>% 
  print(width = 2000)
## Warning: `n_flat_spots()` was deprecated in feasts 0.1.5.
## Please use `longest_flat_spot()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Warning: 1 error encountered for feature 20
## [1] The `fracdiff` package must be installed to use this functionality. It can be installed with install.packages("fracdiff")
## # A tibble: 1 × 48
##     key trend_strength seasonal_strength_week seasonal_peak_week
##   <dbl>          <dbl>                  <dbl>              <dbl>
## 1   100          0.450                  0.743                  0
##   seasonal_trough_week spikiness linearity curvature stl_e_acf1 stl_e_acf10
##                  <dbl>     <dbl>     <dbl>     <dbl>      <dbl>       <dbl>
## 1                    5      369.      197.     -410.      0.139       0.270
##    acf1 acf10 diff1_acf1 diff1_acf10 diff2_acf1 diff2_acf10 season_acf1  pacf5
##   <dbl> <dbl>      <dbl>       <dbl>      <dbl>       <dbl>       <dbl>  <dbl>
## 1 0.222 0.541     -0.277       0.813     -0.504        1.15       0.652 0.0969
##   diff1_pacf5 diff2_pacf5 season_pacf zero_run_mean nonzero_squared_cv
##         <dbl>       <dbl>       <dbl>         <dbl>              <dbl>
## 1       0.905        1.22       0.622             0             0.0119
##   zero_start_prop zero_end_prop lambda_guerrero kpss_stat kpss_pvalue pp_stat
##             <dbl>         <dbl>           <dbl>     <dbl>       <dbl>   <dbl>
## 1               0             0            2.00     0.747        0.01   -14.8
##   pp_pvalue ndiffs nsdiffs bp_stat bp_pvalue lb_stat lb_pvalue var_tiled_var
##       <dbl>  <int>   <int>   <dbl>     <dbl>   <dbl>     <dbl>         <dbl>
## 1      0.01      1       1    18.0 0.0000225    18.1 0.0000208         0.809
##   var_tiled_mean shift_level_max shift_level_index shift_var_max shift_var_index
##            <dbl>           <dbl>             <dbl>         <dbl>           <dbl>
## 1          0.192            164.                45        60503.              42
##   shift_kl_max shift_kl_index spectral_entropy n_crossing_points
##          <dbl>          <dbl>            <dbl>             <int>
## 1         26.9             40            0.804               132
##   longest_flat_spot stat_arch_lm
##               <int>        <dbl>
## 1                 5        0.224

4. The foreaster’s basic toolbox

A tidy forecasting workflow.

Tidy workflow

The steps in a tidy forecasting workflow:

  1. Data preparation
  2. Plot the data (visualize)
  3. Define the model (specify)
  4. Train the model (estimate)
  5. Check model performance(evaluate)
  6. Produce forecasts (forecast)

Some simple forecasting methods: mean, naïve, simple naîve

# Set the training data from January 1 through September 30
train <- flight1 %>% 
  filter(date<'2013-10-01')
new_data <- flight1 %>% 
  filter(date>="2013-10-01")

# Fit the models

flights_fit <- train %>% 
  model(
    Mean = MEAN(flights),
    `Naïve` = NAIVE(flights),
    `Seasonal Naïve` = SNAIVE(flights)
  )

# Generate forecasts
flight_forecast_1 <-  
  flights_fit %>%
  forecast(new_data = new_data) %>% 
  as_tsibble(key = key, index = date)

# Plot forecasts against actual values

flight_forecast_1 %>% 
  autoplot(.mean) +
  autolayer(new_data, flights, colour = "black") +
  labs(y = "Flights",
       title = "Number of flights predicted by mean, naïve, and seasonal naïve") +
  guides(colour = guide_legend(title = "forecast"))